16 research outputs found

    Energy efficient run-time mapping and thread partitioning of concurrent OpenCL applications on CPU-GPU MPSoCs

    Get PDF
    Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) containing CPU and GPU cores are typically required to execute applications concurrently. However, as will be shown in this paper, existing approaches are not well suited for concurrent applications as they are developed either by considering only a single application or they do not exploit both CPU and GPU cores at the same time. In this paper, we propose an energy-efficient run-time mapping and thread partitioning approach for executing concurrent OpenCL applications on both GPU and GPU cores while satisfying performance requirements. Depending upon the performance requirements, for each concurrently executing application, the mapping process finds the appropriate number of CPU cores and operating frequencies of CPU and GPU cores, and the partitioning process identifies an efficient partitioning of the applications’ threads between CPU and GPU cores. We validate the proposed approach experimentally on the Odroid-XU3 hardware platform with various mixes of applications from the Polybench benchmark suite. Additionally, a case-study is performed with a real-world application SLAMBench. Results show an average energy saving of 32% compared to existing approaches while still satisfying the performance requirements

    Mapping parallel programs to heterogeneous multi-core systems

    Get PDF
    Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile to high-performance computing. They promise to deliver increased performance at lower energy cost than purely homogeneous, CPU-based systems. In recent years GPU-based heterogeneous systems have become increasingly popular. They combine a programmable GPU with a multi-core CPU. GPUs have become flexible enough to not only handle graphics workloads but also various kinds of general-purpose algorithms. They are thus used as a coprocessor or accelerator alongside the CPU. Developing applications for GPU-based heterogeneous systems involves several challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus important to carefully map the tasks of an application to the most suitable processor in a system. Secondly, current frameworks for heterogeneous computing, such as OpenCL, are low-level, requiring a thorough understanding of the hardware by the programmer. This high barrier to entry could be lowered by automatically generating and tuning this code from a high-level and thus more user-friendly programming language. Both challenges are addressed in this thesis. For the task mapping problem a machine learning-based approach is presented in this thesis. It combines static features of the program code with runtime information on input sizes to predict the optimal mapping of OpenCL kernels. This approach is further extended to also take contention on the GPU into account. Both methods are able to outperform competing mapping approaches by a significant margin. Furthermore, this thesis develops a method for targeting GPU-based heterogeneous systems from OpenMP, a directive-based framework for parallel computing. OpenMP programs are translated to OpenCL and optimized for GPU performance. At runtime a predictive model decides whether to execute the original OpenMP code on the CPU or the generated OpenCL code on the GPU. This approach is shown to outperform both a competing approach as well as hand-tuned code

    MOCL:An Efficient OpenCL Implementation for the Matrix-2000 Architecture

    Get PDF
    This paper presents the design and implementation of an Open Computing Language (OpenCL) framework for the Matrix-2000 many-core architecture. This architecture is designed to replace the Intel XeonPhi accelerators of the TianHe-2 supercomputer. We share our experience and insights on how to design an effective OpenCL system for this new hardware accelerator. We propose a set of new analysis and optimizations to unlock the potential of the hardware. We extensively evaluate our approach using a wide range of OpenCL benchmarks on a single and multiple computing nodes. We present our design choices and provide guidance how to optimize code on the new Matrix-2000 architecture

    Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

    Get PDF
    General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51× and 4.20× (143× and 67×) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10× speedups over two state-of-the-art automatic GPU code generators

    Post hoc immunostaining of GABAergic neuronal subtypes following in vivo two-photon calcium imaging in mouse neocortex

    Get PDF
    GABAergic neurons in the neocortex are diverse with regard to morphology, physiology, and axonal targeting pattern, indicating functional specializations within the cortical microcircuitry. Little information is available, however, about functional properties of distinct subtypes of GABAergic neurons in the intact brain. Here, we combined in vivo two-photon calcium imaging in supragranular layers of the mouse neocortex with post hoc immunohistochemistry against the three calcium-binding proteins parvalbumin, calretinin, and calbindin in order to assign subtype marker profiles to neuronal activity. Following coronal sectioning of fixed brains, we matched cells in corresponding volumes of image stacks acquired in vivo and in fixed brain slices. In GAD67-GFP mice, more than 95% of the GABAergic cells could be unambiguously matched, even in large volumes comprising more than a thousand interneurons. Triple immunostaining revealed a depth-dependent distribution of interneuron subtypes with increasing abundance of PV-positive neurons with depth. Most importantly, the triple-labeling approach was compatible with previous in vivo calcium imaging following bulk loading of Oregon Green 488 BAPTA-1, which allowed us to classify spontaneous calcium transients recorded in vivo according to the neurochemically defined GABAergic subtypes. Moreover, we demonstrate that post hoc immunostaining can also be applied to wild-type mice expressing the genetically encoded calcium indicator Yellow Cameleon 3.60 in cortical neurons. Our approach is a general and flexible method to distinguish GABAergic subtypes in cell populations previously imaged in the living animal. It should thus facilitate dissecting the functional roles of these subtypes in neural circuitry

    Reliable mapping and partitioning of performance-constrained OpenCL Applications on CPU-GPU MPSoCs

    No full text
    Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) containing CPU and GPU cores are typically required to execute applications concurrently. Existing approaches exploit applications executing in CPU and GPU cores at the same time taking into account performance and energy consumption for mapping and partitioning. This paper presents a proposal for mapping and partitioning of applications in CPU-GPU MPSoCs taking into account the temperature behavior of the system. We evaluate the temperature profiling to partition the applications between CPU and GPU. The profiling is done by measuring the temperature of the CPU and GPU cores while executing different applications at differentpartitions. Results shown up to 13% savings of average temperature of the chip while maintaining performance requirements. A lower thermal behavior represents a better long-term reliability (lifetime) of the SoC

    OpenCL task partitioning in the presence of GPU contention

    No full text
    Heterogeneous multi- and many-core systems are increasingly prevalent in the desktop and mobile domains. On these systems it is common for programs to compete with co-running programs for resources. While multi-task scheduling for CPUs is a well-studied area, how to partitioning and map computing tasks onto the heterogeneous system in the presence of GPU contention (i.e. multiple programs compete for the GPU) remains an outstanding problem. In this paper we consider the problem of partitioning OpenCL kernels on a CPU-GPU based system in the presence of contention on the GPU. We propose a machine learning-based approach that predicts the optimal partitioning of OpenCL kernels, explicitly taking GPU contention into account. Our predictive model achieves a speed-up of 1.92 over a scheme that always uses the GPU. When compared to two state-of-the-art dynamic approaches our model achieves speed-ups of 1.54 and 2.56 respectively

    A workload-aware mapping approach for data-parallel programs

    No full text
    Much compiler-orientated work in the area of mapping parallel programs to parallel architectures has ignored the issue of external workload. Given that the majority of platforms will not be dedicated to just one task at a time, the impact of other jobs needs to be addressed. As mapping is highly dependent on the underlying machine, a technique that is easily portable across platforms is also desirable. In this paper we develop an approach for predicting the optimal number of threads for a given data-parallel application in the presence of external workload. We achieve 93.7% of the maximum speedup available which gives an average speedup of 1.66 on 4 cores, a factor 1.24 times better than the OpenMP compiler's default policy. We also develop an alternative cooperative model that minimizes the impact on external workload while still giving an improved average speedup. Finally, we evaluate our approach on a separate 8-core machine giving an average 1.33 times speedup over the default policy showing the portability of our approach
    corecore